Pipeline Architecture - ChartsMaze EDL Pipeline

The EDL Pipeline follows a strict 6-phase architecture designed to orchestrate 16+ data fetching and processing scripts in the correct dependency order. This page explains why order matters, how phases depend on each other, and the configuration options available.

Architecture Overview

The pipeline is orchestrated by run_full_pipeline.py, which executes scripts sequentially across six phases:

Phase Breakdown

PHASE 1: Core Data (Foundation)

Creates the foundational datasets that all other scripts depend on.

Scripts & Outputs

Script	Output	Purpose
`fetch_dhan_data.py`	`dhan_data_response.json` `master_isin_map.json`	Fetches 2,775 stocks and creates ISIN mapping
`fetch_fundamental_data.py`	`fundamental_data.json`	Quarterly results & financial ratios (35 MB)
NSE CSV Download	`nse_equity_list.csv`	Listing dates for all stocks

Critical Dependency: master_isin_map.json is used by ALL scripts in Phase 2, 2.5, and 4. If fetch_dhan_data.py fails, the pipeline cannot continue.

PHASE 2: Data Enrichment (Fetching)

Parallel execution of 11 data fetching scripts, all consuming master_isin_map.json.

Scripts & Outputs

Script	Output	Description
`fetch_company_filings.py`	`company_filings/{SYMBOL}_filings.json`	Hybrid LODR + Legacy filings
`fetch_new_announcements.py`	`all_company_announcements.json`	Live corporate announcements
`fetch_advanced_indicators.py`	`advanced_indicator_data.json`	Pivot Points, EMA/SMA signals (8.3 MB)
`fetch_market_news.py`	`market_news/{SYMBOL}_news.json`	AI-sentiment news (50/stock)
`fetch_corporate_actions.py`	`upcoming_corporate_actions.json` `history_corporate_actions.json`	Dividends, Bonus, Splits (2 years history + 2 months ahead)
`fetch_surveillance_lists.py`	`nse_asm_list.json` `nse_gsm_list.json`	ASM/GSM surveillance lists
`fetch_circuit_stocks.py`	`upper_circuit_stocks.json` `lower_circuit_stocks.json`	Circuit breaker stocks
`fetch_bulk_block_deals.py`	`bulk_block_deals.json`	Bulk/Block deals (30 days)
`fetch_incremental_price_bands.py`	`incremental_price_bands.json`	Daily price band changes
`fetch_complete_price_bands.py`	`complete_price_bands.json`	All securities price bands
`fetch_all_indices.py`	`all_indices_list.json`	194 market indices

PHASE 2.5: OHLCV Data (Smart Incremental)

Optional phase controlled by FETCH_OHLCV flag. Downloads lifetime historical OHLCV data with intelligent incremental updates.

Scripts & Performance

Script	Output	Performance
`fetch_all_ohlcv.py`	`ohlcv_data/{SYMBOL}.csv`	~2-5 min incremental, ~30 min first-time
`fetch_indices_ohlcv.py`	`ohlcv_data/indices/{INDEX}.csv`	High-speed specialized fetcher

Incremental Logic: Only downloads missing dates if CSV exists, full history otherwise.

If FETCH_OHLCV = False, the following fields will be zero in the final output:

ADR (Average Daily Range)
RVOL (Relative Volume)
ATH & % from ATH
All turnover metrics
Post-earnings returns

PHASE 3: Base Analysis (Building Master JSON)

Single critical script that produces the base structure of all_stocks_fundamental_analysis.json.

bulk_market_analyzer.py Details

Inputs:

fundamental_data.json (Phase 1)
dhan_data_response.json (Phase 1)
advanced_indicator_data.json (Phase 2)
nse_equity_list.csv (Phase 1)

Outputs:

all_stocks_fundamental_analysis.json (Base structure with ~60 fields)

Processing:

Loads fundamental data for all 2,775 stocks
Merges technical data from Dhan response
Adds advanced indicators (Pivots, SMA/EMA status)
Calculates QoQ/YoY growth metrics
Computes valuation ratios (P/E, PEG, ROE, ROCE, D/E)
Adds shareholding patterns (FII/DII changes, Free Float)

This script MUST complete successfully before Phase 4. All Phase 4 scripts modify this JSON file in-place.

PHASE 4: Enrichment Injection (Order Matters!)

Five scripts that sequentially inject additional fields into all_stocks_fundamental_analysis.json.

CRITICAL: These scripts MUST run in this exact order. Each modifies the JSON file in-place.

Order & Dependencies
Why Order Matters

Order	Script	Fields Added	Dependencies
1	`advanced_metrics_processor.py`	ADR, RVOL, ATH, Turnover, Gap Up %, Day Range %	`ohlcv_data/`
2	`process_earnings_performance.py`	Quarterly Results Date, Returns since Earnings, Max Returns since Earnings	`company_filings/`, `ohlcv_data/`
3	`enrich_fno_data.py`	F&O Flag, Lot Size, Next Expiry	F&O data fetchers
4	`process_market_breadth.py`	Relative Strength Rating, Market Breadth metrics	Returns data from base analysis
5	`process_historical_market_breadth.py`	Historical breadth charts	OHLCV data
6	`add_corporate_events.py`	Event Markers, Recent Announcements, News Feed	ALL Phase 2 outputs

Dependency Chain Explained

1. Advanced Metrics First

Calculates volatility (ADR) and volume (RVOL) from OHLCV data
Must run before earnings processor (which needs price gap calculations)

2. Earnings Performance Second

Reads company filings to find quarterly results dates
Calculates returns from earnings date using OHLCV data
Requires base price data to be present

3. F&O Data Third

Independent enrichment, no dependencies on previous steps
Can technically run earlier, but positioned here for logical grouping

4. Market Breadth Fourth

Needs return calculations from base analysis
Computes relative strength vs. market indices

5. Corporate Events LASTAggregates data from ALL Phase 2 sources:

company_filings/ → Recent Announcements
market_news/ → News Feed
upcoming_corporate_actions.json → Event Markers
nse_asm_list.json → Surveillance flags
bulk_block_deals.json → Block deal markers
incremental_price_bands.json → Circuit revision markers

Must be final because it creates summary fields from all other data.

PHASE 5: Compression

Compresses final outputs to .json.gz format with maximum compression.

Compression Details

Files Compressed:

all_stocks_fundamental_analysis.json → .json.gz (~80% smaller)
sector_analytics.json → .json.gz
market_breadth.csv → .json.gz

Typical Results:

Raw JSON: ~35-40 MB
Compressed: ~7-8 MB
Compression ratio: 80%+

PHASE 6: Optional Standalone Data

Controlled by FETCH_OPTIONAL flag. Produces standalone datasets not included in the master JSON.

Optional Scripts

Script	Output	Description
`fetch_all_indices.py`	`all_indices_list.json`	194 market indices
`fetch_etf_data.py`	`etf_data_response.json`	361 ETF details

Note: These are standalone products and not consumed by the master pipeline.

Configuration Flags

Edit these flags inside run_full_pipeline.py (lines 60-71):

# OHLCV: Auto-detect mode
# True = always fetch (incremental update: ~2-5 min if data exists, ~30 min first time)
# False = skip entirely (ADR, RVOL, ATH, % from ATH fields will be 0)
FETCH_OHLCV = True

Impact of Configuration

FETCH_OHLCV
CLEANUP_INTERMEDIATE

Setting	Impact	Runtime	Output Fields
`True`	Full OHLCV download + incremental updates	+2-30 min	All 86 fields populated
`False`	Skip OHLCV entirely	Faster (~4 min total)	15+ fields will be zero

Zero Fields when False:

ADR (5/14/20/30 Days MA)
RVOL
ATH, % from ATH
Gap Up %, Day Range %
% from 52W Low
6 Month Returns
200 Days EMA Volume
Daily Rupee Turnover (20/50/100)
30 Days Average Rupee Volume
Returns since Earnings
Max Returns since Earnings

Setting	Disk Usage After Pipeline	Files Kept
`True`	~50-100 MB	`.json.gz` + `ohlcv_data/` only
`False`	~200-300 MB	All intermediate JSONs + outputs

Deleted Files (when True):

master_isin_map.json
dhan_data_response.json
fundamental_data.json (35 MB)
advanced_indicator_data.json (8.3 MB)
all_company_announcements.json
company_filings/ directory
market_news/ directory
All corporate action JSONs
Raw all_stocks_fundamental_analysis.json (kept as .gz only)

Error Handling Strategy

The pipeline implements a resilient continuation strategy:

Critical Failures

If fetch_dhan_data.py (Phase 1) or bulk_market_analyzer.py (Phase 3) fail, the pipeline stops immediately.These scripts produce the master ISIN map and base JSON that all other scripts depend on.

Enrichment Failures

If any Phase 2 or Phase 4 script fails, the pipeline continues and marks the script as failed.This ensures you get a complete output even if individual data sources are temporarily unavailable.

Final Report

At completion, the pipeline reports:

Total runtime
Successful scripts count
Failed scripts list
Output file size and compression ratio

Performance Characteristics

Minimal Run

Configuration: FETCH_OHLCV = FalseRuntime: ~4 minutesOutput: 60+ fields per stock (missing volume/volatility metrics)

Full Run

Configuration: FETCH_OHLCV = True (incremental)Runtime: ~6-9 minutesOutput: All 86 fields per stock

First-Time Full Run

Configuration: FETCH_OHLCV = True (no existing data)Runtime: ~35-40 minutesOutput: All 86 fields + complete OHLCV history

With Cleanup

Configuration: CLEANUP_INTERMEDIATE = TrueDisk Saved: ~150-200 MBRetained: Only .json.gz + ohlcv_data/

Next Steps

Data Flow

Understand how data transforms across phases

Output Schema

Explore the 86 fields in the final JSON

Quick Start

Run your first pipeline

Configuration

Customize pipeline behavior

​Architecture Overview

​Phase Breakdown

​PHASE 1: Core Data (Foundation)

​PHASE 2: Data Enrichment (Fetching)

​PHASE 2.5: OHLCV Data (Smart Incremental)

​PHASE 3: Base Analysis (Building Master JSON)

​PHASE 4: Enrichment Injection (Order Matters!)

​Dependency Chain Explained

​PHASE 5: Compression

​PHASE 6: Optional Standalone Data

​Configuration Flags

​Impact of Configuration

​Error Handling Strategy

​Performance Characteristics

Minimal Run

Full Run

First-Time Full Run

With Cleanup

​Next Steps

Data Flow

Output Schema

Quick Start

Configuration

Architecture Overview

Phase Breakdown

PHASE 1: Core Data (Foundation)

PHASE 2: Data Enrichment (Fetching)

PHASE 2.5: OHLCV Data (Smart Incremental)

PHASE 3: Base Analysis (Building Master JSON)

PHASE 4: Enrichment Injection (Order Matters!)

Dependency Chain Explained

PHASE 5: Compression

PHASE 6: Optional Standalone Data

Configuration Flags

Impact of Configuration

Error Handling Strategy

Performance Characteristics

Next Steps